The ABC of Computational Text Analysis

#1 Introduction +
Where is the digital revolution?

Alex Flückiger

Faculty of Humanities and Social Sciences
University of Lucerne

22 February 2024

Outline

  1. digital revolution or hype?
  2. about us
  3. goals of this course

AI: A non-standard introduction

The world has changed, hasn’t it?

An era of Big Data + AI

Group discussion

What makes a computer looking intelligent?

AI is a moving target with respect to …

  • human capabilities
  • technological abilities

Transfer of Human Intelligence

from static machines to more flexible devices

  • mimicking intelligent behavior
    • perception: reading + seeing + hearing
    • generation: speaking + writing + drawing
  • contextual adaptation
  • many degrees of freedom

Seeing like a Human?

An image segmentation by Facebook’s Detectron2(Wu et al. 2019)

Hearing like a Human?

Speech-to-Text (S2T)

Recognizing speech robustly (e.g. language, accent, noise)

Speaking like a Human?

Speech-to-Speech (S2S)

Synthesizing generic or personal voice

Outsmarting Humans?

Debunk some myths around ChatGPT

  • is a brand, large-language models (LLM) is the technology

  • generates fluent text, not necessarily truthful

  • is highly useful, although it understands little

  • is English-focused, multilinguality is limited

  • generates non-reproducible outputs

  • generated text cannot be detected (except verbatim parts)

  • yesterday’s version might be different than today’s

Where does the smartness come from?

Number of words exposed (Timiryasov and Tastet 2023)

  • ~100’000’000 for a typical 13-year old kid
  • ~4’300’000’000 words in entire Wikipedia
  • >1’000’000’000’000 for current LLMs 🤯

An LLM is amazing but …

… it is also a stochastic parrot. 🦜

(Bender et al. 2021)

LLMs are a tool…

… learn how to use it 👍

  • use as interactive partner
  • don’t trust; try, refine and develop understanding
  • speed up tasks, yet blind automation not feasible

These people do not exist

Generated Images by a Neural Network (Karras et al. 2020)

Faces generated by StyleGAN. Generate more faces!

Multimodality and guidance

Guided generation of text, audio, images, video

Breakthrough by combining language processing and image generation with Muse (Chang et al. 2023)

Interact with images using text prompts

Editing pictures with Muse using natural language (Chang et al. 2023)

Erase or edit reality

For your Instagram or Politics

Modify pictures thoroughly in Google Photos

From Image to Video Generation 🎥

Synthesize any content with ever increasing quality

Artificial Intelligence

(Converging) Subfields

  • Natural Language Processing (NLP)
  • Computer Vision (CV)
  • Robotics 🤖

How does Computer Intelligence work?

  • interchangeably (?) used concepts
    • Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL)
  • learn patterns from lots of data
    • more recycling than genuine intelligence
    • theory agnostically
  • supervised training is the most popular
    • learn relation between input and output

AI is also hype 📣

AI = from humankind import solution

AI is different to Human Intelligence

Why this matters for
Social Science

Computational Social Science

data-driven research

Group discussion

What kind of data is there?

What data is relevant for social science?

  • data as traces of social behaviour
    • tabular, text, image
  • datafication
    • sensors of smartphone, digital communication
  • much of human knowledge compiled as text

About the mystery of coding

coding is like…

  • cooking with recipes
  • superpowers

Women have coding powers too!

Where the actual revolution is

Coding is a superpower

  • flexible
  • reusable
  • reproducible
  • inspectable
  • collaborative

… to tackle complex problems on scale

About us

Personal example

directed country mentions in UN speeches

Goals of this course 🎯

What you learn

  • collect and curate data
  • computationally analyze, interpret, and visualize texts
  • digital literacy + scholarship
  • problem-solving capacity

Learnings from previous courses

  • too much content, too little practice
  • programming can be overwhelming
  • learning by doing, doing by googling (ChatGPT?!)

Levels of proficiency

  1. awareness of today’s computational potential
  2. analyzing existing datasets
  3. creating + analyzing new datasets
  4. applying advanced machine learning

How I teach

  • computational practises
  • critical perspective on technology
  • lecture-style introductions
  • hands-on coding sessions
  • discussions + experiments in groups

Provisional schedule

Date Topic
22 February 2024 Introduction + Where is the digital revolution?
29 February 2024 Text as Data
07 March 2024 Setting up your Development Environment
14 March 2024 Introduction to the Command-line
21 March 2024 Basic NLP with Command-line
28 March 2024 Introduction to Python in VS Code
04 April 2024 no lecture (Osterpause)
11 April 2024 Working with (your own) Data
18 April 2024 Data Analysis of Swiss Media
25 April 2024 Ethics and the Evolution of NLP
02 May 2024 NLP with Python
09 May 2024 no lecture (Christi Himmelfahrt)
16 May 2024 NLP with Python II + Working Session
23 May 2024 Mini-Project Presentations + Discussion
30 May 2024 no lecture (Fronleichnam)

🖥️ There will be two digital lectures via Zoom.

TL;DR 🚀

You will be tech-savvy…

…yet no programmer applying fancy machine learning

Requirements

  • no technical skills required
    • self-contained course
  • laptop (macOS, Win11, Linux) 💻
    • update system
    • free up at least 15GB storage
    • backup files

Grading ✍️

  • 2 assignments during semester
    • no grades (pass/fail)
  • mini-project with presentation
    • backup claims with numbers
    • work in teams
    • data of your interest
  • optional: writing a seminar paper
    • in cooperation with Prof. Sophie Mützel

Organization

  • seminar on Thursday from 2.15pm - 4.00pm
    • additionally, streaming via Zoom
  • course website KED2024 with slides + information
  • readings on OLAT
  • communication on OLAT Forum

Registration via UniPortal

In order to acquire credits for this course, registration via UniPortal within the registration period is mandatory.

🚨 Registration period: 5th February – 1st March 2024

Assignment #1 ✍️

  • get/submit via OLAT
    • starting tonight
    • deadline: 1 March 2024, 23:59
  • discuss issues on OLAT forum

Course Website

Questions?

References

Bar-Tal, Omer, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, et al. 2024. “Lumiere: A Space-Time Diffusion Model for Video Generation.” January 23, 2024. https://doi.org/10.48550/arXiv.2401.12945.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM. https://doi.org/10.1145/3442188.3445922.
Brooks, Tim, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, et al. 2024. “Video Generation Models as World Simulators.” https://openai.com/research/video-generation-models-as-world-simulators.
Chang, Huiwen, Han Zhang, Jarred Barber, A. J. Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, et al. 2023. “Muse: Text-To-Image Generation via Masked Generative Transformers.” January 2, 2023. https://doi.org/10.48550/arXiv.2301.00704.
Duquenne, Paul-Ambroise, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Hirofumi Inaguma, Christopher Klaiber, et al. 2023. “Multilingual Expressive and Streaming Speech Translation.”
Graham, Shawn, Ian Milligan, and Scott Weingart. 2015. Exploring Big Historical Data: The Historian’s Macroscope. Open Draft Version. Under contract with Imperial College Press. http://themacroscope.org.
Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. “Analyzing and Improving the Image Quality of StyleGAN.” March 23, 2020. https://doi.org/10.48550/arXiv.1912.04958.
Lazer, David, Eszter Hargittai, Deen Freelon, Sandra Gonzalez-Bailon, Kevin Munger, Katherine Ognyanova, and Jason Radford. 2021. “Meaningful Measures of Human Society in the Twenty-First Century.” Nature 595 (7866, 7866): 189–96. https://doi.org/10.1038/s41586-021-03660-7.
Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-László Barabási, Devon Brewer, Nicholas Christakis, et al. 2009. “Computational Social Science.” Science 323 (5915): 721–23. https://doi.org/10.1126/science.1167742.
Lundberg, Ian, Jennie E. Brand, and Nanum Jeon. 2022. “Researcher Reasoning Meets Computational Capacity: Machine Learning for Social Science.” Social Science Research 108 (November): 102807. https://doi.org/10.1016/j.ssresearch.2022.102807.
Plüss, Michel, Lukas Neukom, Christian Scheller, and Manfred Vogel. 2021. “Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus.” June 9, 2021. https://doi.org/10.48550/arXiv.2010.02810.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” December 6, 2022. https://doi.org/10.48550/arXiv.2212.04356.
Salganik, Matthew J. 2017. Bit by Bit: Social Research in the Digital Age. Illustrated edition. Princeton: Princeton University Press. https://www.bitbybitbook.com.
Sheynin, Shelly, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. 2023. “Emu Edit: Precise Image Editing via Recognition and Generation Tasks.”
Timiryasov, Inar, and Jean-Loup Tastet. 2023. “Baby Llama: Knowledge Distillation from an Ensemble of Teachers Trained on a Small Dataset with No Performance Penalty.” In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, 251–61. Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.conll-babylm.24.
Wang, Chengyi, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, et al. 2023. “Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers.” January 5, 2023. https://doi.org/10.48550/arXiv.2301.02111.
Wu, Yuxin, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. “Detectron2.” Meta Research. https://github.com/facebookresearch/detectron2.